Chained operation of functional units in integrated circuit by writing DONE/complete value and by reading as GO/start value from same memory location

ABSTRACT

In an embodiment, the present invention discloses a flexible and reconfigurable architecture with efficient memory data management, together with efficient data transfer and relieving data transfer congestion in an integrated circuit. In an embodiment, the output of a first functional component is stored to an input memory of a next functional component. Thus when the first functional component completes its processing, its output is ready to be accessed as input to the next functional component. In an embodiment, the memory device further comprises a partition mechanism for simultaneously accepting output writing from the first functional component and accepting input reading from the second functional component. In another embodiment, the present integrated circuit comprises at least two functional components and at least two memory devices, together with a controller for switching the connections between the functional components and the memory devices. The controller can comprise a multiplexer or a switching matrix.

This application claims priority from U.S. provisional patentapplication Ser. No. 60/974,451, filed on Sep. 22, 2007, entitled“Soft-reconfigurable massively parallel architecture and programmingsystem”; which is incorporated herein by reference. This application isrelated and co-pending to U.S. patent applications entitled“Soft-reconfigurable massively parallel architecture and programmingsystem”, attorney docket number NAV001A, and NAV001B; and to U.S. patentapplications entitled “Re-configurable bus fabric for integratedcircuit”, attorney docket number NAV003.

FIELD OF THE INVENTION

The present invention relates to apparatuses and methods to integratedcircuits, and more particularly to hardware and software system designand parallel processing architecture and programming system.

BACKGROUND OF THE INVENTION

Everywhere in communication systems, increasingly sophisticatedalgorithms are being used to support higher data rates and richerservices. This is true in all application areas, but perhaps mostvisibly in mobile and video segments, where the move to new generationis driving significant changes in component design for telecomsequipment and Multimedia Video equipment, such as multi stream/channelbased real-time video surveillance equipment where intelligentinline/in-situ decisions have to be made. In addition to basic voice andmessaging, UMTS paves the way for telecom operators and now WIMAX basedopen systems, and possibly open spectrum such as 700 MHz in US willoffer sophisticated data oriented services that industry analystspredict are essential for revenue growth over the next decade.

As people strive for higher data rates or longer reach over fixedchannels, data rates get ever-closer to Shannon's limit and moresophisticated algorithms are required. Indeed, the requirement forsignal processing is rising ten to a hundred times faster than Moore'slaw can deliver.

Estimation and detection algorithms in today's communication systemsrequire the number of operations per second to grow by a factor of tenevery four years; that compares to the increase in processor speed fromMoore's law of a factor of ten every six years. Worse, while Moore's lawholds well for general purpose processors and memory, the difficulty ofintegrating ever bigger systems means that the growth curve for complexSystem-on-a-chip (“SoC”)-ASICs is significantly slower—“the designgap”—with a CAGR of 22%.

Not only must equipment deliver improved performance, design times areunder pressure and budgets are stressed, often in an environment wherestandards are shifting. Example WiMax started out in 2001 (IEEE 802.16d)with stationary network based wireless vision, in 2006 transformed intomobile (IEEE 802.16e) and now wanting to transform further by supportingwide spectrum in FDD & TDD domain to provide further spectrallyefficient transmission of Data, Video, and Voice (802.16m).

A fundamental change approach is required, and a growing awareness ofthe attractiveness of reconfigurable DSP, flexible architectures orother (SDR) systems. Makimoto's wave would suggest such a transition isoverdue with the most desirable characteristics of these techniquesincluding “efficient”, “optimal” or “cost effective”.

SUMMARY

In an embodiment, the present invention discloses a flexible andreconfigurable architecture for microelectronic processing units. Thisarchitecture offers efficient memory data management, together withefficient data transfer and relieving data transfer congestion in anintegrated circuit.

The integrated circuit according to embodiments of the present inventioncan include a plurality of functional components, which typicallycomprise a group of devices for performing a set of logical processing,such as logic design module, a coprocessor, an ALU, a logic designhaving a plurality of RTL code lines, or an IP block. The integratedcircuit can include memory devices to accommodate data passing betweenthe functional components. The functional components can read and writedata to memory devices, and the memory data can pass from one locationto another location so that the functional components can access it. Ina preferred embodiment, to minimize data transfer, the memory can bearranged so that a functional component can write to a memory block thatwill be accessed by the next functional component. Thus when processingpasses to the next functional component, the input data is readilyavailable without any data transfer.

In an embodiment, the output of a first functional component is storedto an input memory of a next functional component. Thus when the firstfunctional component completes its processing, its output is ready to beaccessed as input to the next functional component. Using thisarrangement, data transfer can be minimized, thus relieving datacongestion in an integrated circuit.

In an embodiment, the present integrated circuit comprises at least amemory device and at least two functional components. The functionalcomponents are chained, preferably by software configuration, so that asecond functional component starts after the completion of a firstfunctional component. Further, the memory device is configured toreceive the data output from the first functional component and toprovide input data to the second functional component. The presentintegrated circuit can minimize data transfer during the chain ofprocessing of the two functional components since the input data for thesecond functional component is readily available after the completion ofthe first functional component.

In an embodiment, the memory device further comprises a partitionmechanism for simultaneously accepting output writing from the firstfunctional component and accepting input reading from the secondfunctional component. For example, after the first functional componentpasses the data to the second functional component, it starts generatingthe next set of output data. Thus the memory device is configured tostore the new set of data generated from the first functional component,and to supply the old set of data as input to the second functionalcomponent.

In an embodiment, the memory device comprises at least two memoryportions, one for output writing and one for input reading. The memorydevice can further comprise a controller with at least two states toswitch between the two memory portions. For example, in a first state,the controller selects a first portion to receive the data output fromthe first functional component, and selects a second portion to providedata input to the second functional component. When the functionalcomponents complete processing, the controller switches states. In thesecond state, the controller selects the second portion to receive dataoutput and the first portion to provide data input. The two functionalcomponents can run indefinitely without any memory data transfer.

In an embodiment, there is a plurality of functional components in achain of processing, together with a plurality of memory devices toprovide input and output data. The memory devices are configured tooutput and input data to the chain of functional components with minimumdata transfer. For example, a first functional component writes to afirst memory device. When it completes, the first memory device suppliesthe data as input to a second functional component. The secondfunctional component writes its output data to a second memory device,which then serves as input data to a third functional component.

In an embodiment, the present integrated circuit comprises at least twofunctional components and at least two memory devices, together with acontroller having at least two states. The functional components can beconnected to the memory devices through the controller. The controllerthen can select the memory devices to be connected to the functionalcomponents, depending on the states of the controller. For example, in afirst state, the first/second functional component is connected to thefirst/second memory device respectively. In a second state, thefirst/second functional component is connected to the second/firstmemory device respectively. In an aspect, when the functional componentscomplete processing, the controller switches states. The functionalcomponents are also preferably chained together for serial processing.The memory devices can serve as input and output for the functionalcomponents, and are preferably configured to minimize data transfer. Forexample, the first memory device is first received output data from thefirst functional component, then the controller switches state so thatthe first memory device now is connected to the second functionalcomponent to provide input data.

The controller can comprise a multiplexer or a switching matrix. Also,the integrated circuit can comprise more than two functional componentsand more than two memory devices. The controller can also perform anyconnections between the functional components and the memory devices,thus can accommodate any chaining configuration of the functionalcomponents.

In an embodiment, the functional components can be chained together,preferably by software so that at least one functional component startsafter the completion of at least another functional component. Forexample, the functional component can comprise two control components: aGO component to start the devices, a DONE component to identify thecompletion, and an optional GO_OFF component to indicate that the deviceis busy processing. In an embodiment, the control components (GO, DONE,or GO_OFF) are register for storing the state of the control components.

The chain of functional component can operate without the interaction ofthe processing unit. At the end of the chain, an interrupt can be raisedto get the attention of the processing unit. After chaining, a pluralityof functional components can run independent of the processing unit, canonly require intervention of the processing unit at the end of thechain. The configuration of the chain can be series, parallel, and anycombination thereof, arranged to meet the circuit's objective. Aplurality of chains might be configured, for example, for parallelprocessing, and also for cross data passing between chains. The chainingcan be configured and re-configured, preferably by software input. Forexample, the chaining can be performed by a processing unit or byregister writing. The chaining can also be performed at design time orat run time. The chaining can also be modified, preferably at designtime, but can also be modified at run time.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a block diagram of an exemplary functional component.

FIG. 2 illustrates a schematic of an exemplary functional component.

FIG. 3 illustrates an exemplary connection of functional components.

FIG. 4 illustrates an exemplary chaining methodology for connectingfunctional components.

FIG. 5 shows a preferred embodiment of a Functional Component.

FIG. 6 illustrates an exemplary flowchart for system operation.

FIG. 7 shows an exemplary chain process for a plurality of functionalcomponents.

FIG. 8 illustrates an exemplary configuration of slice arrangement.

FIG. 9 shows an exemplary SOC architecture, comprising a CPU and afunctional structure (FS) coprocessor.

FIG. 10 illustrates an exemplary system configuration with a pluralityof slices.

FIG. 11 illustrates another exemplary architecture with slices.

FIG. 12 illustrates an exemplary feedback loop to prepareimplementations.

FIG. 13 illustrates an exemplary process to mapping applications toexisting implementations.

FIG. 14 illustrates an exemplary hardware/software stack according toembodiments of the present invention.

FIG. 15 shows an exemplary floorplan with slices and bands.

FIG. 16 illustrates an exemplary local bus configuration.

FIG. 17 illustrates an exemplary local arbiter configuration.

FIG. 18 illustrates an exemplary embodiment comprising local memory busand arbiter configuration.

FIG. 19 shows another embodiment of a slice configuration withfunctional components, memories and arbiters.

FIG. 20 shows an exemplary embodiment of memories and functionalcomponent distribution for reducing memory congestion.

FIG. 21 illustrates a slice configuration with memory arbiter and localmemory bus.

FIG. 22 illustrates an exemplary arbiter configuration for a pluralityof slices and IP block.

FIG. 23 illustrates an embodiment where various functional componentsare arranged in a slice.

FIG. 24 illustrates an exemplary band configuration for a plurality ofslices and IP block.

FIG. 25 shows an exemplary system configuration.

FIG. 26 illustrates an exemplary computer system which can be used inthe present invention.

FIG. 27 illustrates a schematic block diagram of a sample computingenvironment.

FIG. 28 illustrates an exemplary memory sharing configuration betweentwo functional components to reduce memory data transfer.

FIG. 29 illustrates an exemplary memory sharing configuration betweenthree functional components.

FIG. 30 illustrates an exemplary memory sharing configuration where thememory is partitioned into two portions.

FIG. 31 illustrates an exemplary memory configuration with a muxcontroller to control the connections between the functional componentsand the memory.

FIG. 32 illustrates an exemplary memory sharing configuration between aplurality of functional components and a plurality of memory devices.

FIGS. 33A and 33B illustrate two states of a MUX controller forswitching connections between functional components and memory devices.

FIG. 34 illustrates another exemplary memory sharing configurationbetween a plurality of functional components and a plurality of memorydevices.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Acronyms:

ASIC Application Specific Integrated Circuit

CAGR Compound Annual Growth Rate

CCB Component Control Block

CPU Central Processing Unit

DDI Digital Design Implementation

DSP Digital Signal Processing

FB Functional Block

FC Functional Component

FCB Flow Control Block

FDD Frequency Division Duplexing

FFT Fast Fourier Transform Block

FIR Finite Impulse Response Filter Block

FPGA Field-Programmable Gate Array

IC Integrated Circuit

IP Intellectual Property embodied in a circuit

LCCB Local Component Control Block

MIMD Multiple-Instruction, Multiple Data

MISD Multiple Instruction, Single Data

MMU Memory Management Unit

OS Operating System

RDL Register Definition Language

SIMD Single Instruction, Multiple Data

SoC System on a Chip

SDR Software Defined Radio

TDD Time Division Duplexing

TDDM Time Division Demultiplexer

TDM Time Division Multiplexer

UMTS Universal Mobile Telecommunications System

WIMAX Worldwide Interoperability for Microwave Access

In some embodiments, this patent discloses a flexible and reconfigurablearchitecture for processing units such as processors, microprocessors,controllers and embedded controllers to address the rapid developmentand shorter cycle of the products. This architecture offerssoft-configurability and soft-reconfigurability to accommodate a varietyof different product families, together with high performance in theform of massive parallelism and high flexibility where the processingunits are soft-programmed to perform different tasks. The presentarchitecture also addresses control congestion by delegating a largenumber of CPU decisions to its slaves, and addresses memory buscongestion with interspersed local memories. The present architecturerelieves the dependency on the CPU for faster execution, providing a newframework for a massively parallel computational system to improveefficiency and performance. In the system, most tasks are to beprocessed on the independent multiple slice subsystems so that thedependency on CPU decreases significantly.

The present device architecture provides real time signal processingcapability with internal reconfigurability suitable for handling highbandwidth digital signal formats such as compressed video, audio,compact disk, digital versatile disc and mixed mode. The architecture ofthe present system provides DSP inherent high computational processingcapability for dynamic video signals with high overall system bandwidth.The system also addresses data processing applications requiring a largenumber of operations, such as digital signal processing, imageprocessing, pattern recognition, and neural network algorithms.

In some embodiments the present invention comprises a powerful andflexible massively parallel system architecture, a softwareinfrastructure, and the complementing programming and software model.The architecture pertains to IC design, such as using configurablebuilding block functions to accomplish custom functions. In someembodiments specific designs and applications disclosed in thisapplication are implemented on FPGA, especially for DSP (digital signalprocessing) and image processing. But the present invention hasapplication in many environments such as DSP (digital signalprocessing), image processing, and other multimedia applications, suchas audio and image compression and decompression, code encryption andvoice/image recognition, and telecommunications.

The present system provides a flexible computer architecture that indifferent embodiments is programmed in a wide variety of ways to performa wide variety of applications. The present system is especially suitedto be programmed to function as a parallel processor. The slices areprogrammed to function as a matrix of processing functional blocks,performing the same operations on different data in parallel. This caseallows the present system to operate as a SIMD processor. In someembodiments the slices correspond to different programs, operating as aMIMD or MISD processor. In other embodiments system also operate as aSISD serial processor.

In embodiments the system provides two components to accelerate systemdesign—a highly flexible, reconfigurable architecture, and a designmethodology that is compatible with this architecture and maximallyutilizes it to achieve huge performance at an affordable price. Thepresent system provides extreme configurability, in that in someembodiments different applications map onto a given design withoutchanging it; low power consumption, in that power optimization abilitiesare incorporated into the architecture itself, and a methodology that anormal engineering team can pick up and use with minimal effort, forexample, to the end-user engineer an implementations of the presentarchitecture are C-language function calls to the peripherals.

The system provides a method for building an embedded system where thesoftware configures how the IC components communicate with each otherand with the software, enabling the overall system to perform manydifferent tasks. In some preferred embodiments the IC components performtheir individual tasks with little or no involvement by the software.

The present architecture provides reconfigurable mixed analog anddigital signal building block functions to accomplish custom functions.This is useful since software is easier to develop, debug, and modify ascompare to hardware system design, which is a difficult, time-consumingtask with turn-around time and long product cycle. Embodiments of thearchitecture provide an embedded system with a high flexibility wherethe software reconfigures the IC components at any time. Thus thepresent system comprises a hardware implementation that is very flexibleand is reused by an entire family of applications. For example, a singleembodiment supports a family of DSP applications, while anotherembodiment supports most image-processing applications.

1. Architecture to Address Control Congestion

In some embodiments a building block for the system is a functionalcomponent (FC), comprising a functional block (FB) and a flow controlblock (FCB). The FB is a group of devices for performing a set oflogical processing, such as logic design module, a coprocessor, an ALU,or a logic design having a plurality of RTL code lines. The FCBcomprises controllable start and stop functionality for the functionalblock.

As one example, the FBs contain phase locked loop (PLL) blocks, macroblocks, operational amplifiers, comparators, analog multiplexers, analogswitches, voltage/current reference, switched capacitor filters, gm/Cfilters, data converters, communication blocks, clock generation blocks,customizable input/output blocks, fixed design input/output blocks, andprocessor blocks.

In some embodiments the FCB starts the FB when the FCB detects a startsignal. When the FB completes operation it informs the FCB it hashalted. Then the FB halts until the flow control block starts it again.In some embodiments FBs sit adjacent to local memories. Here the FBreceives its input from some local memory, and writes its output toother local memory, based on the address where the data is stored. Insome embodiments the FCB behaves like a software-controlled switch, toturn on and off the FB.

In some embodiments the FB can be an IP block. The FCB has a done flag,to signify that the functional block has completed its operation. TheFCB has a next flag switch, to identify the next FB to activate.

FIG. 1 illustrates an exemplary functional component 10, comprising FB11 in connection with a FCB 12 through a functional control data path14. The FB 11 may communicate with other components or devices such as amemory block (not shown) through a data path 13, for example, totransfer data. The FCB 12 communicates with other devices or componentsthrough the flow control path 15, to receive external command or to senddata.

FIG. 2 shows a FB 20, comprising a FB 21 in connection with a FCB 22.Data is transferred from or to the FB 21 through the data path 23. TheFCB 22 sends signal 24A to start the FB 21, and the FB 21 sends signal24B to identify the completion of the processing. The FCB comprise a GOcomponent 12A and a DONE component 12C. When the GO component 12A isset, e.g. having a value of 1, it starts the FB 21 by sending a startcommand through the signal path 24A. In some embodiments FCB 22comprises a GO component 12A that starts FB 21 and thus startsprocessing when the GO component changes its value. In some embodimentsthe GO component is an address of a register (or memory) in a ComponentControl Block (CCB, not shown). When the GO CCB data changes, the FCBrecognizes the change and starts processing.

In some embodiments, after the GO command starts its processing, the FB21 resets the GO component 12A and stops monitoring the GO component 12Auntil after it finishes processing. In other embodiments the flowcontrol block 22 comprises a GO_OFF component 12B that identifies thatthe FB is still processing and thus not available for taking a newcommand.

In some embodiments the GO_OFF component is be an address of a register(or memory) in a Component Control Block (CCB). When the FB 21 startsprocessing, it changes the GO_OFF CCB data to identify that the FB isbusy processing and thus not available. If the GO component is set, itwaits until the GO_OFF signal clears before the FB 21 can startprocessing again.

In some embodiments the FB 21 also resets the DONE component 12C toidentify that it has started processing, and set the DONE component whenit finishes processing. When the DONE component 12C is set, e.g. havinga value of 1, this signifies that the functional component 21 hascompleted its processing. In some embodiments after complete processing,the FB sends a DONE signal through signal 24B to the FCB 22 to set theDONE component 12C. In some embodiments the DONE component 12C is amemory-mapped register (or memory) in a Component Control Block (CCB).When the DONE CCB data changes, other devices or blocks recognize thatthe functional block 21 has finished processing.

These particular embodiments are just exemplary embodiments, and skilledpersons versed in the art will recognize that there are alternative waysto practice the FCB to control the FB.

In some embodiments the GO, GO_OFF and DONE components include more thanone elements linked together in an AND or OR gates. For instance in someembodiments there are be 4 registers for each of the components. In someembodiments the four GO components are connected with an OR gate, inother words there are 4 ways to start the FB 21 by setting each of theGO components. In some embodiments the four GO components are connectedwith an AND gate, meaning all four GO components have to be set beforethe FB can start. Here the AND connection provide a synchronizationfeature, allowing the FB to wait for the four conditions to be satisfiedbefore start processing. In other embodiments the various GO componentsare connected in various logical fashions, allowing for a variety ofscenarios.

In some embodiments the DONE components are more than one registerseach. The architecture uses the DONE components to signal the completionof the present FB, which then signals the start of another FB. In someembodiments multiple DONE components allow the chaining of multipleother FBs after the completion of the current one.

Further, in some embodiments the FBs are chained to each other,effectively creating a flow of operation linking multiple FBs. In someembodiments the connections are designed with software at design time,linking the function blocks to perform the desired functionality of theIC chip.

In some embodiments the connections are performed through software,reading through a memory-mapped register interface for connectingtogether the components of the IC. In such embodiments the softwareprogram further specifies how the components of the IC interrupt, andthus change the execution sequence of the software program. Also, thesoftware program specifies how the components of the IC sequencethemselves with data passing without any intervention.

FIG. 3 illustrates an exemplary connection of four functional components31-34, which comprises an FB or IP block. The DONE component of FBs 31and 33 is chained to the GO component of FCs 33 and 34 through the link35 and 36 respectively. With this exemplary chain, the completion of FB31 triggers the start of FC 33, which in turn, after completion,triggers the start of function component 34. Thus in effect, thechaining allows the serial processing of FCs 31, 33 and 34.

FIG. 4 illustrates an exemplary chaining methodology, comprising two FBs41 and 42, together with a CCB (component control block) 43. The FC41/42 comprises a FB 41A/42A and a FCB with three registers of GOcomponent 41B/42B, GO_OFF component 41C/42C and DONE component 41D/42D,respectively. The registers 41B-41D and 42B-42D contain the address forthe CCB 43, with the corresponding value of 43A-43E.

In some embodiments the FCBs connected to a central processing unit(CPU) for configuring or reconfiguring the address stored in theseregisters 41B-41D and 42B-42D. These addresses correspond to theregisters 43A-43E in the CCB 43. The linking of FC 41 and 42,symbolically illustrated as the linkage 44 between the DONE component41D and the GO component 42B, is performed by assigning the DONEcomponent 41D and the GO component 42B the same address of the CCBregister 43C. In essence, FB 41A finishes processing, it sets the valueof the DONE register 41D, which is stored in register 43C. Since this isprecisely the value of the GO register 42B, FB 42A thus receives thestart signal as soon as the FB 41A finishes. The two FBs 41 and 42 arethen chained serially together.

In some embodiments the CPU sets the register 43A of the CCB to startthe chain function of FUs 41 and 42. Also, the last DONE component 42Dof FC 42 sets the register 43E, which is an interrupt 46 to the CPU.Thus the completion of the chain 41/42 raises an interrupt 43E, whichalerts the CPU to take appropriate action.

In some embodiments the CCB is a table of 2^(N) bits, referred to by bitaddresses CCB[0:2^(N)-1]. In some embodiments the CCB table ismemory-mapped so the CPU is able to view it and to write it. In oneaspect, CCB[0] is set to be zero and CCB[1] is 1. CCB[0] and CCB[1] canbe hard-wired. A portion of the CCB table, CCB[2:2^(M)-1] with M<N, isreserved as interrupts to the CPU.

FIG. 5 shows an embodiment where the FC has two pieces, an FB, which canbe any logic device or IP block, and an FCB. The blocks can have simplememory-mapped register set and also provides interrupts to the CPU. TheFCB interacts with the CCB, in an embodiment such as this one through 12N-bit addresses in the flow control block: 2^(Q) (4 shown) startaddresses GO_ADDR_(—)0, GO_ADDR_(—)1, . . . , 2^(R) (4 shown) busyaddresses GO_OFF_ADDR_(—)0, GO_OFF_ADDR_(—)1, . . . , and a plurality (1shown) of completion/chaining addresses DONE_ADDR_(—)0, . . . .

In some aspects, in normal operation of the component, the FC starts thecomponent when the start condition involving a logical function for thestart addresses is satisfied. For example, CCB[GO_ADDR_(—)0[N-1:0]]==1.At the time the machine starts, the FCB sets the busy signal in the CCBto indicate the status of the functional block. For example,CCB[GO_OFF_ADDR_(—)0[N-1:0]]==0.

In such as embodiment at the time the machine completes, the FCB setsthe completion signal in the CCB, to indicate the completion status andpossibly to start the chaining process. For example,CCB[DONE_ADDR_(—)0[N-1:0]]==1. This completion mechanism allows the CPUto chain together a series of predefined components in such a way thatthey run in series.

In embodiments such as the one described above, the FCB is interlinkedwith the CCB where the FCB carries the addresses and the CCB carries thevalue. In some aspects not all CCB bits connect to every FC. Theconnection is typically determined for a given implementation, whereeach CCB bits is connected to a particular FC. This prevents needlesscongestion for the CCB bits. In some embodiments the connection issoftware driven, meaning the registers of the flow control blocks areset by the CPU following the current program. This mechanism effectivelyperforms the chaining of the various functional components, creating thenecessary flow of functions residing in the FCs.

In some embodiments the system starts with the CPU initializes theconnections (FIG. 6). This is possible since the flow control blocks aredesigned to be memory-mapped for the CPU to access. The initializationchains the FCs together in series, parallel, or in any other logicalways. The chaining is performed through the start addresses and thecompletion addresses. For example, FC A, at completion, starts anotherFC in series. In some embodiments Component A starts a plurality ofother FCs in parallel. In some embodiments an FC starts after receivingthe completion signal of another FC; in other embodiments it waits untilreceiving a plurality of completion signals, arranged in a predeterminedlogic. For example, FC C is chained from other FCs D and E through ANDlogic. This chaining determines that FC C only starts after both FCs Dand E complete processing. If D completes processing before E, C isstill waiting since the AND logic only permits C to start if both startsignals are satisfied.

In some embodiments after initialization, the CPU starts the chainprocess by setting the start signal in the CCB; the CPU does this bywriting the CCB's memory-mapped registers. After stating the chainprocess, the CPU leaves it all alone; conversely the device onlyinteracts with the CPU through its interrupts. The interrupts signifythat the chain process is completed and it is time for the CPU to startanother chain process. This mechanism significantly reduces CPUcongestion, since the demand on CPU time is now only a small fractioncompared to the processing time.

In some embodiments, as FIG. 7 illustrates, the chain process includes aTDDM (time division demultiplexer), a FIR, a FFT and a TDM FCs. The FCsare connected so that the TDDM block is chained to the FIR, then to theFFT, and then to the TDM. In some typical operations the TDDM preparesthe data and turns on the FIR block. The FIR processes the data and whencompleted, turns on the FFT. Once the FFT block finishes, it turns onthe TDM, and at the TDM completion, dumps the data into a memory andsignals the completion to interrupt the CPU.

In some embodiments, the FCs are arranged as a series of slices wherethe CPU accesses all FCs and the FCs are tied to the CCB, which is aglobally shared resource. The slice and CCB configuration allow for avery high level of parallelism in computation. The CCB and the FC softinterconnection is logically a soft interconnection architecture whichconnects many devices.

In some embodiments the embedded system comprises a family of slices.Here each different slice design in the family contains a differentassortment of FCs. In some embodiments library blocks are added to theselected slice to increase the functionality. In some embodiments thesestandard library blocks are provided independently and separately fromthe slices, while in others they are not.

In some embodiments each slice executes different instructions ondifferent FCs using different data streams. Here, after each FC hascompleted its task, it passes the results to the next FC, and waits forthe next instruction. Therefore, the FCs are each synchronized to oneanother and are capable of passing data amongst themselves. In someembodiments once the slice completes processing its data it raises aninterrupt to alert the CPU. Each FC has its functionality is configuredby software running on the CPU, and the interconnect between the FCs isalso configured by the software running on the CPU. So an embodiment canperform many different dedicated functions by configuring and connectingthe system, using only those FCs needed for its implementation.

FIG. 8 illustrates a configuration of slice arrangement for someembodiments. The chip, such as an FPGA, is partitioned into a pluralityof slices 51-54, accessed through a global bus 56 and connections 57.There is a plurality of FCs in a slice, for example FCs 51A-51F in slice51. In some embodiments an IP block occupies whole slice, i.e., theslice 54 is an IP block. Alternately in some embodiments an IP block,e.g. 51E, is embedded in a slice, e.g. 51. In some embodiments the IPblocks are disposed separately at optimal locations for maximumperformance and density. In some embodiments IP blocks are incorporatedinto a slice as a FB, implemented similarly to other blocks in a design.

In some embodiments the FCs in a slice are the same. In some embodimentsthey are different. There is a plurality of different slice types whereeach slice type has the same FC. In this exemplary embodiment, slice 51and 52 are the same type with the same FCs, slice 53 is a different typeof slice and slice 54 is an IP block. In some preferred embodiments theimplementation of functional blocks within a slice, and the distributionof slice types within a chip is analyzed and predetermined to service afamily of applications. The contents of a slice and the types of slicesin an IC are based on the family of applications. In an exemplaryembodiment targeting a DSP-application slice contains input ports, aTDDM (1 stream→N streams), an FIR, an FFT and a TDM, or an IP blockViterbi.

In some embodiments, the FCs in different types of slices are configuredin various configurations. In some of these embodiments, the FCs withinthe same slice type are arranged in a same configuration, effectivelyfor performing parallel processing. In some embodiments when there arenot enough slices of the same type, slices of different types are alsoconfigured in this same configuration. In some embodiments slices of thesame type are configured differently to provide different functionality.There is enough flexibility in configuring the FCs and slices, with apossible limit being the availability of FCs and slices.

The configuration is performed by software. So after the program isloaded into the CPU, the CPU uses an initialization process to configurethe FCs and the slices. This soft configurability lets a chip of thepresent architecture service a whole family of applications.

In some embodiments the slices have a same configuration, allowingparallel processing of the same process, similar to a SIMD computingmechanism. In some embodiments the slices have different configurations,allowing parallel processing of different processes, similar to MIMDcomputing mechanism. In some embodiments the slices are chained togetherto provide serial processing, for example, one long chain for SISDmechanism, and many parallel chains for SIMD or MIMD mechanisms. In someembodiments the present architecture provides massive parallelism, withvirtual unlimited scalability for highly cost effective expansion.

In a SIMD (single instruction, multiple data stream) computer, all theprocessors simultaneously execute an identical instruction withdifferent data set. The main processor is tightly coupled to maintainsynchronous operation of the various processors while each processorindependently operates upon its data stream. In a MIMD (multipleinstructions, multiple data stream) computer, the processors aredecoupled and execute instructions independent of the other processors,using an instruction memory and program sequencer logic associated witheach processor.

The present architecture combines SISD, SIMD and MIMD architectures.Instructions within a slice are sequentially operated. Instructionssupplied to different slices having same configuration can all beoperated from a single instruction. Instructions supplied to differentslices having different configurations can all be operated from multipleinstructions. In some embodiments the individual functional blocks andslices are selectively decoupled from the others to perform individualtasks, and to provide the result to the other blocks or main processor.

In some embodiments the architecture provides for 256 slices. The exactnumber of slices in an embodiment depends on particular implementation,and expansion capability, which allows some flexibility in theunderlying logic design without requiring changes to the software, andon how the designer wants to design the connections of the FCs withinthe slices. The connections are used to form custom circuitry such asconfigurable mixed-signal functions.

In some embodiments the present architecture provides large flexibilitywhile alleviating a core problem of control congestion. FIG. 9 shows anexemplary flexible SOC architecture, comprising a CPU, a functionalstructure (FS) coprocessor (including slices of functional components,and component control block (CCB)), together with other peripheralsincluding memory, communication protocol assemblies such as Ethernet orUART components. In some embodiments the FCs include digital logic thatcontain at least 16 bits of state, and 16 simple gates of logic.Examples of FCs include FIRs, FFTs, Reed Solomon Decoders, and DESencryption/decryption engines. The CCB is a logic component. Every FCcommunicates the CCB. For an embodiment the designer chooses which FCsto use and their associate memory size based on what functionality theywant the system to have. The designers choose how the softwareinterconnects these components also based on this.

In some embodiments he present architecture reduces control congestionby reducing the requirement of CPU interactions. For example, there is alimit to a CPU capability in service a number of slave devices. In atypical system not of this architecture the CPU starts each slave deviceon its respective task, and when a slave finishes its current task, itraises an interrupt for the CPU to intervene, possibly by starting theslave again on some other task. When the number of slave devices exceedsthe capability of the CPU, for example hundreds or thousands of slavedevices, then the CPU is strained into servicing all these slavedevices, and performance may suffer.

FIG. 10 illustrates an exemplary system configuration, including a CPU60 controlling a plurality of slices 61-64 through a global bus 66. Theslices are connected to a CCB 65, with interrupt signals 67 back to theCPU 60. The number of slave devices are reduced significantly with theslice configuration, and thus congestion to the CPU is reducedaccordingly.

In some embodiments, the present architecture relieves this congestionby grouping the number of slave devices into slices, effectivelyreducing the number of slave devices that the CPU needs to service (FIG.11). The control of the slices is passed to the CCB, so that data flowsfrom one device, e.g., a functional block, to the next withpredetermined control by the CCB and without the CPU intervention. Insome embodiments IP blocks are also incorporated within this scheme. Ingeneral once a particular device finishes its operation, it informs theCCB that it is complete. Then the CCB turns on the next device in lineto process that data. The CCB can also wait until multiple devices arecompleted before starting another device. The CCB acts autonomously,without the CPU intervening, and therefore capable of reducing controlcongestion for the CPU.

In exemplary embodiments, the CCB comprises a plurality of sections witheach section covering a plurality of slices. For example, the FCs inslice 61 are connected to section 69 in the CCB 65. Also in an aspect,not all CCB bits connect to every FC in a slice. The connection istypically determined for a given implementation, where each CCB bits isconnected to a particular FC. The sections can provides interrupts 67 tothe CPU, together with local bus 68 for communication between thesections. In an embodiment, each CCB bit is connected to every FC. In apreferred embodiment, the FCs in each slice are connected to a sectionin the CCB, thus reducing interconnections between the FCs and the CCBbits. The missing connections can be covered by the local bus 68.

2. Implementation Design

The present invention further discloses implementations of functionalcomponents and slices for various family of applications. Theimplementation is chosen to map well with a variety of applications,ensuring enough power and devices in the implementation to meet theneeds of the application, well-matched against the application tominimize surplus in die area, memory, and/or clock speeds, and stillserve the need at hand, and having the right components for theapplication. Various metrics could be built into the slice and stored inlocal memory or output on debug channels. These include, but are notlimited to, timestamps, throughput, memory collisions, FC timing andactivity.

In some embodiments, a feedback loop is employed using software toanalyze how the application fits onto the implementation (FIG. 12). Anapplication is mapped to an existing implementation and availablemetrics, providing data to an analysis program. The program calculateswhat FC is used and how often. If the fit is not good enough in someway, this knowledge is used to generate another implementation. Forexample, the FC that often used can be duplicated and multiplied and theFC that is not used can be reduced or eliminated. The application isthen mapped to the new implementation, and the feedback loop continues.

Over time, a library of implementations is built and a software is thenused to analyze a given application's needs (FIG. 13). For example, anapplication is mapped to the available implementation in the library,which can undergo an analysis program to recommend a particularimplementation based on those needs.

3. Software Component

The system described here comprises a hardware architecture, a softwarearchitecture, a programming model, and a flow methodology.

The hardware architecture typically comprises a CPU, global memory,various analog peripherals, a global memory bus, and a plurality ofslices, functional components and component control block. In someembodiments the present architecture includes various analogperipherals, depending on the specific application. For example, someembodiments implementing DSP functionality have A/D's, D/A's andantennas. Some embodiments implementing networking applications willhave SER/DES interfaces.

The present architecture is CPU-agnostic with low control congestion.Thus any microprocessor is suitable. Some embodiments have MMUs andothers do not. Some embodiments that have an MMU will use it; otherswill not.

This system significantly reduces control congestion reduction becausethe CPU does not need to get involved in detail control of individualfunctional blocks, but only to set up the CCB, the arbiters, and thelogical blocks. Once the whole engine is started, the CPU's involvementis minimal.

Ultimately the CPU controls the whole system. At any given time the CPUobserves and/or controls any other given component in the system if itis programmed to do so. In some embodiments, however, the CPU delegatesa significant portion of control to the CCB. Thus this architectureutilizes distributed control flow to reduce data congestion.

The present system includes a software programming model. On reset theCPU initializes various system components, such as chaining functionalcomponents and slices, using register writing. For example, the chainingsets up a string of DSP functions in a sequence. The whole design thenwaits for data to come in. The system components process the data withno CPU intervention. In some embodiments the system components interruptthe CPU. In some embodiments the CPU queries the system as it runs, fordebugging, checking status, and dataflow analysis.

The present invention further discloses a system infrastructure,providing a means of rapidly developing a prototype for an application,a means to analyze a prototype, allowing developers to see easily whatcan be improved, and advises to developers on the selection of librarydesign implementation given a set of requirements.

After a determination of the logic blocks, the infrastructure modelassigns the logic blocks to memory address ranges after checking forconflicts, and generates the register definition files and the API forthe other pieces of software to use.

The designers can start with a digital design implementation (DDI) forrapid prototype with functional descriptions. The system has a libraryof DDI's, together with an expert system to help the users decide whichDDI in the library is appropriate for prototyping a given application.The software programming model offers C language API to program, with aregister map showing how every register is memory-mapped. At the pointthe user has defined his application in software on top of the DDI, themodel analyzes the utilization of the DDI to determine which pieces arenecessary for a final product and which are not. It emits a record ofthis. The digital designers use this to help them implement the finalproduct. The number of local memories is analyzed and excess memory isremoved for the final product.

During runtime, the CCB track the process time of each functional block.This information is used to turn down the clock speeds for eachfunctional block in the final product for power optimization.

FIG. 14 illustrates an exemplary hardware/software stack according toembodiments of the present system. The hardware stack comprises a devicestack 70, which includes slices hardware and IP blocks 70A,communication block such as Ethernet hardware 70B, memory mappedEthernet devices 70C, and global memory 70D. The hardware stack furthercomprises a system logic 71, which includes a CPU 71B and memory bus andarbiters 71A. On top of the hardware stack is the software stack 72,which comprises a hardware abstraction layer (HAL) 72A, thecommunication stack which includes the Ethernet stack 72B, the IP stack72C, the TCP/UDP stack 72D, and the stacks of SNMP, HTTP, TFTP, DHCP72E, together with the OS stack 72F, and the application software 72G.

The software HAL sits on top of the CPU, the memory, and the hardware,which the CPU accesses as memory-mapped registers. The HAL (HardwareAbstraction Layer) provides an interface layer for higher-layer softwareto access the slice hardware and other IP blocks. There is also anEthernet stack for communication, so the device is accessible overEthernet. Finally depending on the application in question, there may behigher level software that runs on the system.

The software further includes RDL (Register Definition Language), whichis a simple language by which registers and their addresses are defined.It provides abstract names to all registers, which are memory-mapped.The input view of RDL is a file that describes each register, plus itsmappings. This can replicate multiple instances as different things. Oneoutput view of this is the register definition, specifying each registerin the design along with its memory map address.

The HAL is a thin layer of abstraction. It allows the higher layers ofsoftware to access the registers in the Slices and the IP blocks withsome abstraction. It is implemented as a set of C function calls, whichthe C language calls use the HAL registers to access the functionalblocks.

The software can be implemented for optimizing the connections of theFCs. By monitoring the FCs, for example, through a counter in the CCBfor the usage of these units, and how often they are on, thecharacteristics of the connections for the circuit can be determined.Thus from a code stand point, the CCB connections can be changed and theperformance measured. Various connections can be analyzed, and thesoftware can determine an optimize set of CCB connections for the FCswith respect to desired performance, such as low power consumption orfast response.

Some embodiments have other peripheral digital devices in the overallsystem besides the CPU and the FS. To include these in the interface,their register definitions are added for the access registers to theRDL. The regular memory is accessed normally by the CPU, without theneed to go through the HAL.

The software architecture is OS-agnostic. However, the hard real-timenature of the applications at hand requires the operating system be hardreal-time. And it is desirable the OS have a small memory footprint.Some examples of this include MicroC/OS and eCOS. In some embodimentsthe OS runs on the CPU for control functions.

The Ethernet stack, IP stack, TCP/UDP stack, and the software aboveit—the SNMP stack, the HTTP stack, the DHCP stack and the TFTP stack—isa series of software modules to allow communication, which are designedfor testing devices. Also, it is useful for a device in the field to beable to communicate by this method. In some embodiments these functionsare present. In others these functions are not essential and are removedfor cost effectiveness.

The HAL (Hardware Abstraction Layer) is located in the bottom of theprogramming model. It is a thin layer of abstraction. It allows thehigher layers of software to access the registers in the Slices and theIP blocks with some abstraction.

The software is toolchain-agnostic. In some embodiments it uses the GNUtool suite, which includes gcc for compiling, gdb for debugging, andancillary tools such as the BFD. When the system turns on the system theOS starts a thread. This thread initializes all the components in thesystem—all the slices, the IP components, the CCB, etc. Once this allhappens the system is ready to run. This thread turns off. If the systemrequires other threads, for instance to monitor the Ethernet and to runthe communications stacks, then the OS also starts those threads.

In some embodiments the system includes more software support, such ascode to assign memory addresses to all the slices, IP blocks and CCB andsoftware to generate the HAL.

Some embodiments of this system are implemented on a machine or computerreadable format, e.g., an appropriately programmed computer, a softwareprogram written in any of a variety of programming languages. Thesoftware program is written to carry out various functional operationsof the present system. Moreover, a machine or computer readable formatof the present invention may be embodied in a variety of program storagedevices, such as a diskette, a hard disk, a CD, a DVD, a nonvolatileelectronic memory, or the like. The software program, known as asimulator, may be run on a variety of devices, e.g. a CPU.

With reference to FIG. 26, an exemplary environment 300 for implementingvarious aspects of the invention includes a computer 301, comprising aprocessing unit 331, a system memory 332, and a system bus 330. Theprocessing unit 331 can be any of various available processors, such assingle microprocessor, dual microprocessors or other multiprocessorarchitectures. In various embodiments the system bus 330 is of diversetypes of bus structures or architectures, such as 12-bit bus, IndustrialStandard Architecture (ISA), Micro-Channel Architecture (MSA), ExtendedISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB),Peripheral Component Interconnect (PCI), Universal Serial Bus (USB),Advanced Graphics Port (AGP), Personal Computer Memory CardInternational Association bus (PCMCIA), or Small Computer SystemsInterface (SCST).

In some embodiments the system memory 332 includes volatile memory 333and nonvolatile memory 334. Nonvolatile memory 334 refers to read onlymemory (ROM), programmable ROM (PROM), electrically programmable ROM(EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatilememory 333, refers to random access memory (RAM), synchronous RAM(SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rateSDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), ordirect Rambus RAM (DRRAM).

Computer 301 also includes storage media 336, such asremovable/nonremovable, volatile/nonvolatile disk storage, magnetic diskdrive, floppy disk drive, tape drive, Jazz drive, Zip drive, LS-100drive, flash memory card, memory stick, optical disk drive such as acompact disk ROM device (CD-ROM), CD recordable drive (CD-R Drive), CDrewritable drive (CD-RW Drive) or a digital versatile disk ROM drive(DVD-ROM). In some embodiments removable or non-removable interface 335is used to facilitate connection.

In some embodiments the computer system 301 further includes software tooperate in environment 300, such as an operating system 311, systemapplications 312, program modules 313 and program data 314, which arestored either in system memory 332 or on disk storage 336. In differentembodiments various operating systems or combinations of operatingsystems are used.

In some embodiments input devices 322 are used to enter commands ordata, and include a pointing device such as a mouse, trackball, stylus,touch pad, keyboard, microphone, joystick, game pad, satellite dish,scanner, TV tuner card, sound card, digital camera, digital videocamera, web camera, and the like, connected through interface ports 338.Possible interface ports 338 include a serial port, a parallel port, agame port, a universal serial bus (USB), and a 1394 bus. In someembodiments the interface ports 338 also accommodates output devices321. such as a USB port, to provide input to computer 301 and to outputinformation from computer 301 to an output device 321. Output adapter339, such as video or sound cards, is provided to connect to some outputdevices such as monitors, speakers, and printers.

In the exemplary embodiment computer 301 operates in a networkedenvironment with remote computers 324. The remote computers 324, shownwith a memory storage device 325, can be a personal computer, a server,a router, a network PC, a workstation, a microprocessor based appliance,a peer device or other common network node and the like, and typicallyincludes many or all of the elements described relative to computer 301.embodiments such as this remote computer 324 connect to computer 301through a network interface 323 and communication connection 337, withwire or wireless connections. In some embodiments network interface 323are communication networks such as local-area networks (LAN), wide areanetworks (WAN) or wireless connection networks. LAN technologies includeFiber Distributed Data Interface (FDDI), Copper Distributed DataInterface (CDDI), Ethernet/IEEE 1202.3, Token Ring/IEEE 1202.5 and thelike. WAN technologies include, but are not limited to, point-to-pointlinks, circuit switching networks like Integrated Services DigitalNetworks (ISDN) and variations thereon, packet switching networks, andDigital Subscriber Lines (DSL).

As an exemplary embodiment FIG. 27 shows a schematic block diagram of asample computing environment 40 with which the present invention caninteract. The system 440 includes a plurality of client systems 441. Thesystem 440 also includes a plurality of servers 443. In such anembodiment the server 443 is used to employ the present invention. Thesystem 440 includes a communication network 445 to facilitatecommunications between the clients 441 and the servers 443. Client datastorage 442, connected to client system 441, can store informationlocally. Similarly, the server 443 can include server data storages 444.

4. Architecture to Address Memory Congestion

Typically, a parallel processing computer contains a plurality ofprocessors coupled to one another by a data stream bus and aninstruction bus. The processors typically share local memory via thedata bus.

Global bus is adapted to support data transfer between the slices, themain processor, and I/O controller. The global bus is configured tocarry both instructions and data. Memory bus congestion occurs if everytime a device needs to read or to write data, it does so on the globalmemory bus. If dozens or hundreds of devices try to access the globalmemory bus at the same time, then the bus itself would become thebottleneck.

The present system is different from this. In some embodiments thepresent architecture has local memories interspersed throughout the ICfor reducing memory congestion. When a device attempts to access amemory location in a local area slice, the access goes directly to thatlocal memory and not to the global memory bus.

Thus memory data access is often the bottleneck forcing long stalls onparallel processor systems, mainly due to the sharing registers andbuses. In some embodiments memory contention is significantly reducedwith the present massively parallel architecture.

In some embodiments, in the IC floorplan, the memories are distributedthroughout the chip area, often uniformly. Thus the present slices andbands attempt to exploit this geographic locality. In an embodiment, theIC is built on an underlying geography—or floorplan—of a functionalstructure where logic is randomly spread throughout the device, andmemories are somewhat evenly distributed.

FIG. 15 shows an exemplary floorplan with slices and bands, using slicelocal memory bus and band local memory bus. Slices are series offunctional components interspersed with local memories and run e.g.,north to south. When a functional block inside of a slice accesses amemory local to that slice, then that access stays local and does not goout to the system memory bus. This minimizes traffic and thus contentionon the global memory bus. Similarly bands are logical constructs thatrun perpendicular, e.g., east to west. When a functional block inside aband accesses a memory local to that band, then that access stays localand does not go out to the system memory bus.

FIG. 16 illustrates an exemplary configuration, showing global memorybuses 81 and 82 and a plurality of memory 83A-C, 84A-C, and 85A-C. Toreduce memory congestion, local memories 83A-85C are dispersedthroughout the IC area, together with local memory bus 81A, 81B and 82A,82B. Local memory buses 81A and 81B can run vertically, and connected toglobal memory bus 81. Local memory buses 82A and 82B can run in anotherdirection, for example, horizontally, and connected to global bus 82.

Thus memory 83A, 83B and 83C can be connected through vertical slicelocal bus 81A without a need for global memory bus. Similarly, memories84A-84C and 85A-85C are also connected through vertical local memory bus81B. Further, memories 83A, 83B, 84A, 84B, 85A and 84B can be connectedby horizontal band local memory bus 82A. Similarly, memories 83C, 84C,and 85 C are connected by horizontal band local memory bus 82B. Thus thedistributed memories are connected with vertical slice local bus (81A or81B), or horizontal band local bus (82A or 82B). Only when access isoutside of the local area, for example, when memory 83A needs access tomemory 84C, then global memory bus is used. With proper incorporation oflocal memories buses, this global memory bus access is significantlyreduced, leading to high memory congestion reduction.

Some embodiments address memory congestion with memory arbiters wheremost data traffic is through the local bus between memory arbiters. Thearrangement of local memory arbiters amounts to a small local memorybus, connecting a few of the memories.

Alternatively, in some embodiments, memories can be connected througharbiters. FIG. 17 illustrates an exemplary embodiment of 2 devices 93Aand 93B, with 2 local memories 92A and 92B, connected through the localarbiters 91A-91D. Arbiters 91A and 91C are memory arbiter, controllingaccess to the memory 92A and 92B. Arbiter 91B and 91D are devicearbiter, controlling access to the device 93A and 93B. With thisconfiguration, device 93A can access memory 92A through the devicearbiter 91B and memory arbiter 91A. Similarly, device 93A can alsoaccess memory 92B through arbiters 91B and 91C. With memories dispersedaround the device, memory access is routed through the arbiters, thusrelieving global memory bus congestion.

Alternately, in some embodiments, local memory bus and arbiterconfiguration are combined. FIG. 18 illustrates an exemplary embodiment,showing a global memory bus 100, connecting two local memory buses 101Aand 101B through two bus arbiters 102A and 102B respectively. In eachlocal memory bus, the device arbiters and the memory arbiters arecontrolling the device and the memory, respectively, in terms ofcommunication with the local memory buses. With such a configuration,very local communication is made through the arbiter. Localcommunication is also made through the local memory bus. Andcommunications outside the local area are made with the global memorybus, which can be designed to be a rare occurrence.

FIG. 19 shows another embodiment, illustrating the functionalcomponents, memories and arbiters disposed within a slice, and connectedto a CCB and outside slice memory bus.

FIG. 20 shows another embodiment of memory configuration for reducingmemory congestion. Memories 202A and 202B are alternatively connected toFCs 201A and 201B through, for example, multiplexers 203A and 203B. Anexemplary operation can be as followed. FC 201A runs, receiving andgenerating data from memory 202A through control signal 204 guiding themultiplexers 203A. When FC 201A completes processing, control signal 204switches, and now FC 201A receives and generates data from memory 202B.In the mean time, FC 201B runs, receiving and generating data frommemory 202A. When the FCs complete processing, the control signalswitches, causing the FCs to access alternative memory. Thisconfiguration can reduce memory congestion, since no memory needs to betransferred. The multiplexer is an exemplary embodiment, and otherimplementation can be carried out for switching memories between aplurality of FCs. Further, the above example uses two memories and twoFCs, but any number of memories and FCs can be used.

FIG. 28 illustrates a general block schematic of a distribution of amemory device between two functional components where the firstfunctional component can write to the memory device and the secondfunctional component can read from the memory device. The functionalcomponents are preferably running in series, with the second functionalcomponent starts execution after the completion of the first functionalcomponent. With this memory arrangement, input data for the secondfunctional component is ready immediately after the output data from thefirst functional component is written. Thus memory data transfer can besignificantly reduced, and in this case, there is no memory datatransfer.

FIG. 29 illustrates a block schematic of a chain of functionalcomponents linking to a plurality of memory devices. A first functionalcomponent generates data to a first memory, which then supplies to asecond functional component. The second functional component writes datato a second memory, which then supplies to a third functional component.The functional components run in series, one after another, and thememory data is automatically ready for the next functional componentafter the completion of the previous functional component.

In an embodiment, the functional components are linked together by acomponent control block, so that the functional components can beexecuted in series (or parallel depending of the desired configuration)as shown in FIG. 30. In an aspect, the memory is preferably partitionedinto a plurality of portions (two shown) to support the two functionalcomponents at the same time. For example, the first memory portion canbe used to receive output from a first functional component, and thesecond memory portion can be used to provide input to a secondfunctional component. FIG. 31 illustrates another embodiment where thememory is partitioned into two portions. The circuit further includes aswitching component, shown as two multiplexers (MUX), to switch theportions of the memory device. In a first state, the muxes provideconnections from the first/second portions of the memory to thefirst/second functional components. After the functional componentscomplete processing, the circuit switches to a second state where themuxes provide connections from the first/second portions of the memoryto the second/first functional components. In this embodiment, thefunctional components can process simultaneously without any datatransferring.

FIG. 32 illustrates a configuration of a plurality of functionalcomponents connected to a plurality of memory devices through aswitching matrix such as a mux matrix. Each functional component canread and write data from different portions of a same memory device orfrom different memory devices, controlled by a control signal to the muxmatrix. This circuit allows various chaining configuration of thefunctional components, and provides the memory input and output to thechain configurations with minimum memory data transfer.

In another embodiment, there can be a plurality of memory devicesinstead of a plurality of portions of memory. FIG. 33 illustrates twomemory devices connected to two functional components through aswitching component such as a mux matrix. FIG. 33A illustrates a firststate where the mux runs parallel, and FIG. 33B illustrates a secondstate where the mux run crosswise to connect the devices. Similarly,there can be a plurality of functional devices and a plurality of memorydevices connected through a connection block such as a mux matrix, asshown in FIG. 34.

5. Architecture to Address Control and Memory Congestion

Some embodiments of the present system combine control congestionreduction with slice architecture and memory congestion reduction withlocal bus and arbiter configuration. The configuration comprisesmultiple slice sections, comprising a series of functional componentsinterspersed with local memories. In some aspects, the FC and the localmemory each have a dedicated memory arbiter. In such an aspect the FCsand the local memories can be positioned next to each other, thus an FChas access to two local memories on each side by going through thememory arbiter for that memory. Alternatively, in some embodiments theslice contains a slide memory bus, to service the request for datawithin a slice. The memory arbiter and the slice memory bus free muchtraffic from the global memory bus, relieving data congestion andcontention on the global memory bus.

In some aspects, the processing unit contains IP blocks with dedicatedmemories and arbiters. Here the arbiter for an IP block is connected toa plurality of slice arbiters to access data from the slices. Thisconfiguration provides local memory access, thus improving congestionwithin the global memory bus. In some embodiments an IP block memoryarbiter is also connected to global memory bus.

In some embodiments, the present processing unit contains a ComponentControl Block (CCB). The CCB enables the chaining a series of predefinedfunctional components, performing the connections between the functionalblocks. After proper chaining, when a functional component or IP blockfinishes its operation, it uses the CCB to start the next functionalcomponent or IP block to continue the process, processing its dataoutputs. In some embodiments a portion of the CCB includes interrupts tothe CPU to request CPU assistance, such as the completion of a sliceoperation. Not all CCB bits need to connect to every FC. The circuitdesign and implementation determine for a given embodiment andfunctionality, which CCB bits connect to which particular functionalcomponents. This design prevents needless congestion for the CCB.

FIG. 21 illustrates a slice configuration with memory arbiter and localmemory bus. A slice comprises a series of local memories and a series offunctional components. A local memory bus 113 connects the global memorybus 110 and pass through the slice 112 to the CCB 111. Each functionalcomponent has a dedicated memory arbiter. And each local memory has adedicated memory arbiter. The functional components and the memorycomponents in a slice logically alternate. Within the slice 112,functional components F are interspersed with memories M, and both areconnected to the local memory bus through functional and memory arbiterA. If a functional component in a slice tries to access a logicallyadjacent memory component then its arbiter routes its request directlyto the memory arbiter for that memory, rather than going to the memorybus. Otherwise the memory request goes out to the slice memory bus. Withthis configuration, slice functionality rarely needs to access theglobal memory bus 110, since the majority of actions and memory accessare contained with the slice 112. The CPU has access to all functionalcomponents and the memories through the slice memory bus.

In another embodiment, the present processing unit comprises a CPU(central processing unit) which can monitor and control the wholesystem, including the CCB, the slices, the functional components and theIP blocks. The present architecture allows the CPU to supervise insteadof independently control every component at the same time. After settingup the CCB, the memory arbiters and the flow control blocks, the CPUinvolvement is minimal and does not get involved in the operation of theindividual functional components. In an embodiment the operation of theCPU is limited only to the handling of interrupts, or specificoperations.

The CPU uses the memory bus, e.g., global and local, to accesseverything in the system. The CPU memory-maps all the registers in thesystem. The CPU uses this ability to initialize or reset the system andto query or set the various pieces of the system as the need arises. Insome embodiments there are some global memory in the system, dependingon the needs of the application at hand. Typically very little globalmemory is needed. The memory can be ROM, DRAM, SRAM, flash or anycombination thereof. In another aspect, the internal memoriesdistributed throughout the slices and the other IP blocks are primarilyfor local use, and not considered global memory, even though the CPU hasaccess to them through the global memory bus.

In an exemplary embodiment the configuration, shown in FIG. 22,comprises slices, which are series of FCs interspersed with localmemories together with slice memory bus and local memory arbiterconnections. When components inside of a slice access a physicallyadjacent memory then that access stays local and does not go out to thesystem memory bus. This minimizes traffic and thus contention on theglobal memory bus. Further in this embodiment each slice and IP blockhas its own memory arbiter. A given IP block is chained to some slicesthrough arbiter connection, providing local arbiter access instead ofglobal memory bus access. In such an embodiment, every slice and IPblock has a memory arbiter connected to the global memory bus.

FIG. 23 illustrates an embodiment where various functional componentsare arranged in a slice. The functional components are configured for atypical DSP application, chaining a series of functionality, startingfrom an A/D converter, passing to a TDDM block, continuing with FIR,FFT, IP block Viterbi, and finally to the D/A converter. The CCBcontrols the serial execution, with the local memory passingsuccessively through each neighbor functional block.

Further, in some embodiments the present architecture providesadditional bandwidth through the additional band configuration, shown inFIG. 24. This design provides additional bandwidth for the high dataflow, reducing flow congestion. For example, in some such embodiments,slices and IP block are connected through band memory bus, in additionto slice memory bus.

FIG. 25 shows an exemplary system configuration, further comprising aCPU for control the functional structure.

While the invention is amenable to various modifications and alternativeforms, specifics thereof have been shown by way of example in thedrawings and will be described in detail. It should be understood,however, that the intention is not to limit the invention to theparticular embodiments described. On the contrary, the intention is tocover all modifications, equivalents, and alternatives falling withinthe spirit and scope of the invention as defined by the appended claims.

What is claimed is:
 1. An integrated circuit comprising: a firstfunctional component and a second functional component, each functionalcomponent comprising a plurality of logic devices for performing afunction; a GO component coupled to a first functional component forstoring a first electrical address of a connector memory, the GOcomponent functioned to start the first functional component when thevalue of the connector memory specified by the first electrical addressis set to a GO value; and a DONE component coupled to a secondfunctional component for storing a second electrical address of aconnector memory, the DONE component functioned to identify thecompletion of the second functional component by setting the value ofthe connector memory specified by the second electrical address to aDONE value; wherein by setting the first electrical address of the firstfunctional component to be the same as the second electrical address ofthe second functional component, the first and second functionalcomponents are chained so that the first functional component startsafter the completion of the second functional component.
 2. A circuit asin claim 1 further comprising: a storage memory device servicing thefirst and second functional components, wherein the storage memorycomprises two memory portions, and further comprises a controller havingat least two states to switch connection of the memory portions to thefunctional components.
 3. A circuit as in claim 2 wherein there are morethan two functional components chained together, and wherein there are aplurality of storage memory devices to provide input and output to thechain of functional components.
 4. A circuit as in claim 2 wherein thecontroller comprises a multiplexer.
 5. A circuit as in claim 2 whereinthe controller comprises a switching matrix.
 6. A circuit as in claim 2wherein there are a plurality of functional components and a pluralityof memory devices linked together by the controller to allow selectiveaccess of memory devices by a functional component.
 7. An integratedcircuit comprising: a first and a second functional components, eachfunctional component comprising: a GO component for storing a firstelectrical address of a connector memory, the GO component functioned tostart the functional component when the value of the connector memoryspecified by the first electrical address is set to a GO value; a DONEcomponent for storing a second electrical address of a connector memory,the DONE component functioned to identify the completion of thefunctional component by setting the value of the connector memoryspecified by the second electrical address to a DONE value; and whereinby setting the first or second electrical address of the firstfunctional component to be the same as the first or second address ofthe second functional component, the functional components are chainedso that the first functional component starts after the completion ofthe second functional component.
 8. A circuit as in claim 7 wherein thefunctional component comprises a group of devices for performing a setof logical processing.
 9. A circuit as in claim 7 wherein the functionalcomponent is selected from a group consisting of a logic module, aprocessor, a coprocessor, an arithmetic logic unit, a logic designhaving a plurality of RTL code lines.
 10. A circuit as in claim 7wherein an electrical state of the connector memory specified by thefirst electrical address of the GO component starts the functionalcomponent.
 11. A circuit as in claim 7 further comprising a GO_OFFcomponent for identifying that the functional component is busyprocessing.
 12. A circuit as in claim 7 further comprising a storagememory device servicing the first and second functional components,wherein the first functional component stores output to the storagememory device, and wherein the second functional component receivesinput from the storage memory device.
 13. A circuit as in claim 12wherein the storage memory has a partition mechanism for simultaneouslyaccepting writing from the first functional component and acceptingreading from the second functional component.
 14. A circuit as in claim12 wherein the storage memory has a two memory portions, and furthercomprising a controller having at least two states to switch connectionof the portions to the functional components.
 15. A circuit as in claim12 wherein there are a plurality of functional components and storagememory devices, and the functional components and the storage memorydevices are chained together.
 16. A circuit as in claim 12 furthercomprising a second storage memory connecting to the functionalcomponent through a memory bus.
 17. A circuit as in claim 14 wherein thecontroller comprises one of a multiplexer and a switching matrix.
 18. Acircuit as in claim 14 wherein there are a plurality of functionalcomponents and a plurality of memory devices linked together by thecontroller to allow selective access of memory devices by a functionalcomponent.